Compressed Regression 



Shuheng Zhou* John Lafferty* 1 ^ Larry Wasserman^ 

* Computer Science Department 
^Machine Learning Department 
^Department of Statistics 

Carnegie Mellon University 
Pittsburgh, PA 15213 

February 1, 2008 
Abstract 

Recent research has studied the role of sparsity in high dimensional regression and 
signal reconstruction, establishing theoretical limits for recovering sparse models from 
sparse data. This line of work shows that l\ -regularized least squares regression can 
accurately estimate a sparse linear model from n noisy examples in p dimensions, even 
if p is much larger than n. In this paper we study a variant of this problem where the 
original n input variables are compressed by a random linear transformation torn < n 
examples in p dimensions, and establish conditions under which a sparse linear model 
can be successfully recovered from the compressed data. A primary motivation for 
this compression procedure is to anonymize the data and preserve privacy by reveal- 
ing little information about the original data. We characterize the number of random 
projections that are required for i\ -regularized compressed regression to identify the 
nonzero coefficients in the true model with probability approaching one, a property 
called "sparsistence." In addition, we show that i\ -regularized compressed regression 
asymptotically predicts as well as an oracle linear model, a property called "persis- 
tence." Finally, we characterize the privacy properties of the compression procedure 
in information-theoretic terms, establishing upper bounds on the mutual information 
between the compressed and uncompressed data that decay to zero. 



Keywords: Sparsity, l\ regularization, lasso, high dimensional regression, privacy, 
capacity of multi-antenna channels, compressed sensing. 
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I. Introduction 



Two issues facing the use of statistical learning methods in applications are scale and privacy. 
Scale is an issue in storing, manipulating and analyzing extremely large, high dimensional data. 
Privacy is, increasingly, a concern whenever large amounts of confidential data are manipulated 
within an organization. It is often important to allow researchers to analyze data without compro- 
mising the privacy of customers or leaking confidential information outside the organization. In 
this paper we show that sparse regression for high dimensional data can be carried out directly on 
a compressed form of the data, in a manner that can be shown to guard privacy in an information 
theoretic sense. 

The approach we develop here compresses the data by a random linear or affine transformation, 
reducing the number of data records exponentially, while preserving the number of original input 
variables. These compressed data can then be made available for statistical analyses; we focus on 
the problem of sparse linear regression for high dimensional data. Informally, our theory ensures 
that the relevant predictors can be learned from the compressed data as well as they could be from 
the original uncompressed data. Moreover, the actual predictions based on new examples are as 
accurate as they would be had the original data been made available. However, the original data 
are not recoverable from the compressed data, and the compressed data effectively reveal no more 
information than would be revealed by a completely new sample. At the same time, the inference 
algorithms run faster and require fewer resources than the much larger uncompressed data would 
require. In fact, the original data need never be stored; they can be transformed "on the fly" as they 
come in. 

In more detail, the data are represented as a n x p matrix X. Each of the p columns is an attribute, 
and each of the n rows is the vector of attributes for an individual record. The data are compressed 
by a random linear transformation 

X h-> X = OX (1.1) 

where O is a random m x n matrix with m « b. It is also natural to consider a random affine 
transformation 

X H> X = OX + A (1.2) 
where A is a rand om m x p matrix. Such tran sformations have been called "matrix masking" in the 



privacy literature (|Duncan and Pearsonlll991|) . The entries of O and A are taken to be independent 
Gaussian random variables, but other distributions are possible. We think of X as "public," while 
O and A are private and only needed at the time of compression. However, even with A = and O 
known, recovering X from X requires solving a highly under-determined linear system and comes 
with information theoretic privacy guarantees, as we demonstrate. 
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In standard regression, a response Y = Xp + e e W is associated with the input variables, where 
£i are independent, mean zero additive noise variables. In compressed regression, we assume that 
the response is also compressed, resulting in the transformed response Y e W n given by 

Y H> Y = Oy (1.3) 
= OX/J + Oe (1.4) 
= xp + e (1.5) 

Note that under compression, the transformed noise e = is not independent across examples. 

In the sparse setting, the parameter vector p e M p is sparse, with a relatively small number s of 
nonzero coefficients supp (/?) = {j : Pj ; # 0}. Two key tasks are to identify the relevant variables, 
and to predict the response x T P for a new input vec tor x e R p . Th e method we focus on is t\- 



regularized least squares, also known as the lasso (Ti bshiraniL 119961) . The main contributions 
of this paper are two technical results on the performance of this estimator, and an information- 
theoretic analysis of the privacy properties of the procedure. Our first result shows that the lasso is 
sparsistent under compression, meaning that the correct sparse set of relevant variables is identified 
asymptotically. Omitting details and technical assumptions for clarity, our result is the following. 

Sparsistence (Theorem 13.41) : If the number of compressed examples m satisfies 



? / 
C\s log tips < m < / , (1.6) 

V logH 



and the regularization parameter X m satisfies 



X m -> and ^ -> oo, (1.7) 

log/? 

then the compressed lasso solution 

P m = arg min — 1| Y - Xp || \ + X m \\ fi || t (1.8) 

p 2m 

includes the correct variables, asymptotically: 

P (supp(P, n ) = SUpp(P)) -> 1. (1.9) 

Our sec ond result shows that the lass o is persistent under compression. Roughly speaking, per- 



sistence (IGreenshtein and RitovL 120041) means that the procedure predicts well, as measured by the 



predictive risk 



R(p) = E(Y - Xp) 2 , (1.10) 



where now X e M p is a new input vector and Y is the associated response. Persistence is a weaker 
condition than sparsistency, and in particular does not assume that the true model is linear. 
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Persistence (Theorem 14.11) : Given a sequence of sets of estimators fi n m , the sequence of com- 
pressed lasso estimators 

Pn, m = argmin \\Y -Xfi\fc (1.11) 

\\fi\\l<L n ,m 

is persistent with the oracle risk over uncompressed data with respect to B nytn , meaning that 

R0nm)- inf R(fi) -A 0, asn^oo. (1.12) 

U\\\<L n ,m 

in case log 2 (np) < m < n and the radius of the £\ hall satisfies L„ m = o (m/ log(np)) 1 / 4 . 

Our third result analyzes the privacy properties of compressed regression. We consider the prob- 
lem of recovering the uncompressed data X from the compressed data X = OX + A. To pre- 
serve privacy, the random matrices <D and A should remain private. However, even in the case 
where A = and <D is known, if m <$C mm(n, p) the linear system X = Q>X is highly under- 
determined. We evaluate privacy in information theoretic terms by bounding the average mutual 
information I(X; X)/np per matrix entry in the original data matrix X, which can be viewed as a 
communication rate. Bounding this mutual information is intimately connected with the problem 
of computing the channel capacity of certain multiple- antenna wireless communication systems 



dMarzetta and Hochwaldlll 9991: iTelatarM 19991) 



Information Resistence (Propositions 15.11 and 15.21) : The rate at which information about X is 
revealed by the compressed data X satisfies 



I(X}_X) 
np 

where the supremum is over distributions on the original data X. 



sup ' = 0( — ) -> 0, (1.13) 

np \n/ 



As summarized by these results, compressed regression is a practical procedure for sparse learning 
in high dimensional data that has provably good properties. This basic technique has connections in 
the privacy literature with matrix masking and other methods, yet most of the existing work in this 
direction has been heuristic and without theoretical guarantees; connections with this literature are 
briefly reviewed in Section [2~Cl Compressed regression builds on the ideas underlying compressed 
sensing and sparse inference in high dimensional data, topics which have attracted a great deal 
of recent interest in the statistics and signal processing communities; the connections with this 
literature are reviewed in S ection [2TB1 and l27Al 

The remainder of the paper is organized as follows. In Section [2] we review relevant work from 
high dimensional statistical inference, compressed sensing and privacy. Section [3] presents our 
analysis of the s parsistency proper ties of the compressed lasso. Our approach follows the methods 



introduced by IWainwrightl (I2006|) in the uncompressed case. Section |4] proves that compressed 
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regression is persistent. Section [5] derives upper bounds on the mutual information between the 
compressed data X and the uncompressed data X, after identifying a correspondence with the 
problem of computing channel capacity for a certain model of a multiple-antenna mobile com- 
munication channel. Section [6] includes the results of experimental simulations, showing that the 
empirical performance of the compressed lasso is consistent with our theoretical analysis. We 
evaluate the ability of the procedure to recover the relevant variables (sparsistency) and to predict 
well (persistence). The technical details of the proof of sparsistency are collected at the end of 
the paper, in Section [7TTJ1 The paper concludes with a discussion of the results and directions for 
future work in Section [8] 



II. Background and Related Work 

In this section we briefly review relevant related work in high dimensional statistical inference, 
compressed sensing, and privacy, to place our work in context. 

A. Sparse Regression 

We adopt standard notation where a data matrix X has p variables and n records; in a linear model 
the response Y = Xfi + 6 el" is thus an n-vector, and the noise is independent and mean zero, 
E(e) = 0. The usual estimator of /? is the least squares estimator 

ft = (X T Xy 1 X T Y. (2.1) 

However, this estimator has very large variance when p is large, and is not even defined when 
p > n. An esti mator that has received much attention in the recent literature is the lasso p n 



(|Tibshiranil 1 1996|) . defined as 

(2-2) 



j n p 

% = argmin — ^(Yt - Xf fi) z + 



= argmin L\\ Y - xp\\\ + k n \\P\\u (2.3) 

where k n is a regularization parameter. The practical success and importance of the lasso can be 
attributed to the fact that in many cases /? is sparse, that is, it has few large components. For 
example, data are often collected with many variables in the hope that at least a few will be useful 
for prediction. The result is that many covariates contribute little to the prediction of Y, although 
it is not known in advance which variables are important. Recent work has greatly clarified the 
properties of the lasso estimator in the high dimensional setting. 

One of the most basic desirable properties of an estimator is consisistency; an estimator /?„ is 
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consistent in case 



(2.4) 



Meinshausen and Yul (120061) have recently shown that the lasso is consistent in the high dimen- 
sional setting. If the underlying model is sparse, a natural yet more demanding criterion is to ask 
that the estimator correctly identify the relevant variables. This may be useful for interpretation, 
dimension reduction and prediction. For example, if an effective procedure for high-dimensional 
data can be used to identify the relevant variables in the model, then these variables can be isolated 
and their coefficients estimated by a separate procedure that works well for low-dimensional data. 
An estimator is sparsisten^ if 



P (suppOff n ) = suppOff)) -> 1, 



(2.5) 



where supp(/?) = [j : j ^ 0}. Asymptotically, a sparsistent estimator has nonzero coeffi- 
cients only for the true relevant variables. Spa rsistency proofs for high dimensio nal problems 
have appeared recently in a number of settings. Meinshausen and Buhlmannl (|2006r) consider the 



lowing spar- 



problem of estimating the graph underlying a sparse Gaussian graphical model by s 
sistenc y of the lasso with exponential rates of convergence on the probability o f error. IZhao and Yu 
( 2007 ) show sparsistency of the lasso under more general noise distributions. Wainwright ( 20061) 



characterizes the sparsistency properties of the lasso by showing that there is a threshold sample 
size n(p, s) above which the relevant variables are identified, and below which the relevant vari- 
ables fail to be identified, where s = ||/?||o is the number of relevant variables. More precisely, 



Wainwrightl (|2006Q shows that when X comes from a Gaussian ensemble, there exist fixed con- 
stants < 0£ < 1 and 1 < 6 U < +oo, where 0i = 6 U = 1 when each row of X is chosen as an 
independent Gaussian random vector ~ N(0, Ipxp), then for any v > 0, if 



n > 2(9 U + v)s \og(p - s) + s + 1, 



(2.6) 



then the lasso identifies the true variables with probability approaching one. Conversely, if 

n <2(6 e -v)s\og(p-s) + s + \, (2.7) 

then the probability of recovering the true variables using the lasso approaches zero. These results 
require certain incoherence assumptions on the data X; intuitively, it is required that an irrele- 
vant variable cannot be too strongly correlated with the set of relevant variables. This result and 
Wainwright's method of analysis are particularly relevant to the current paper; the details will be 
described in the following section. In particular, we refer to this result as the Gaussian Ensemble 
result. However, it is important to point out that under compression, the noise e = Oe is not 
independent. This prevents one from simply applying the Gaussian Ensemble results to the com- 
pressed case. Related work that studies information theoretic limits of sparsity recovery, where 



This terminology is due to Pradeep Ravikumar. 
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the particular estimator is not specified, includes (IWainwrightl 120071 : iDonoho and Tannerl . 12006). 
Sparsistency in the classification setting, with exponentia l rates of convergence for £\ -regularized 
logistic regression, is studied by IWainwright et al.l (|2007|) . 



An alternative goal is accurate prediction. In high dimensions it is essential to regularize the model 
in some fashion in order to control the variance of the e stimator and attain good predi ctive risk. 
Persistence for the lasso was first defined and studied by iGreenshtein and Ritovl (|2004l) . Given a 
sequence of sets of estimators B n , the sequence of estimators /?„ e B n is called persistent in case 



Rifin) - inf R(fi) -» 0, 



(2.8) 



where R(fi) = E(Y — X 7 fS) 1 is the prediction risk of a new pair (X, Y). Thus, a sequence of 
estimators is persistent if it asymptotically predicts as well as the oracle within the class, which 
minimizes the population risk; it can be achieved under weaker assumptions than are required for 
sparsistence. In particular, persistence does not assume the true model is linear, and it does not 
require strong incoherence assumptions on the data. The results of the current paper show that 
sparsistence and persistence are preserved under compression. 



B. Compressed Sensing 

Compres s ed regression has close connections to, and draws motivation from, com pressed sensing 
(bonohoi l2006l : Icandes et all bood : Icandes and Taoi bood : iRauhut et all hoOlk However, in a 
sense, our motivation here is the opposite to that of compressed sensing. While compressed sensing 
of X allows a sparse X to be reconstructed from a small number of random measurements, our goal 
is to reconstruct a sparse function of X. Indeed, from the point of view of privacy, approximately 
reconstructing X, which compressed sensing shows is possible if X is sparse, should be viewed as 
undesirable; we return to this point in Section [5J 

Sever al authors have considered variations on compressed sensing for statistical signal proce ssing 
tasks (buarte et all bood: bavenport etall bood : lHaupt et all bood : bavenport etall 12007m . The 
focus of this work is to consider certain hypothesis testing problems under sparse random mea- 
surements, and a generalization to classification of a signal into two or more classes. Here one 
observes y = <Djc, where y e W n , x e W and O is a known random measurement matrix. The 
problem is to select between the hypotheses 

fit : y = ®( Sl +6), (2.9) 

where e e W is additive Gaussian noise. Importantly, the setup exploits the "universality" of the 
matrix O, which is not selected with knowledge of s, . The proof techni ques use concentration prop 
erties of random projection, which underlie the celebrated lemma of Ijohnson and Lindenstrauss 
(| 19841) . The compressed regression problem we introduce can be considered as a more challeng- 
ing statistical inference task, where the problem is to select from an exponentially large set of linear 
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models, each with a certain set of relevant variables with unknown parameters, or to predict as well 
as the best linear model in some class. Moreover, a key motivation for compressed regression is 
privacy; if privacy is not a concern, simple subsampling of the data matrix could be an effective 
compression procedure. 



C. Privacy 



Rese arch on privacy in statistical data ana l ysis h as a long history, going back at least to iDalenius 
(|1977a|) : we refer to Duncan and Pearson! (| 199 if) for discussion and further pointers into this lit- 
erature. The compression method we employ has been called matrix masking in the privacy lit- 
erature. In the general method, the n x p data matrix X is transformed by pre-multiplication, 
post-multiplication, and addition into a new m x q matrix 



X = AXB + C. 



(2.10) 



The transformation A operates on data records for fixed covariates, and the transformation B op- 
erates on covariates for a fixed record. The method encapsulated in this transformation is quite 
general, and allows the possibility of deleting records, suppressing subsets of variables, data swap- 
ping, and including simulated data. In our use of matrix masking, we transform the data by re- 
placing each variable with a relative ly small number of random averages of the instances of that 
variable in the data. In other work, ISanil et al.1 (|2004l) consider the problem of privacy preserving 
regression analysis in distributed data, where different variables appe ar in different dat abases but 
it is of interest to integrate data across databases. The recent work of iTing et al.l (|2007l) considers 
random orthogonal mappings X i— > RX = X where R is a random rotation (rank n), designed to 
preserve the sufficient statistics of a multivariate Gaussian and therefore allow regression estima- 
tion, for instance. This use of matrix masking does not share the information theoretic guarantees 
we present in Section[5J We are not aware of previous work that analyzes the asymptotic properties 
of a statistical estimator under matrix masking in the high dimensional setting. 



The work of lLiu et al.1 (|2006h is closely related to the current paper at a high level, in that it consid- 
ers low rank ran dom linear transformations of either the row space or column space of the data X. 
Liu et al.1 (I2006[) note the Johnson-Lindenstrauss lemma, which implies that £2 norms are approx- 
imately preserved under random projection, and argue heuristically that data mining procedures 
that exploit correlations or pairwise distances in the data, such as principal components analysis 
and clustering, are just as effective under random projection. The privacy analysis is restricted 
to observing that recovering X from X requires solving an under-determined linear system, and 
arguing that this prevents the exact values from being recovered. 



An information-theoretic quantification of privacy was formul ated bvlAgrawal and Agg arwal ( 200l|) 
Given a random variable X and a transformed variable X, lAgrawal and Aggarwall (|200l|) define 
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the conditional privacy loss of X given X as 

V(X\X) = \-2~ I{x ' x \ 



(2.11) 



which is simply a transformed measure of the mutual information between the two random vari- 
ables. In our work we identify privacy with the rate of information communicated about X through 
X under matrix masking, maximizing over all distributions on X. We furthermore identify this 
with the problem of computing, or bou nding, t he Sh anno n capacity of a multi-an t enna w ireless 
communication channel, as modeled by lTelatarl (|1999|) and lMarzetta and Hochwaldl (|1999|) . 



Finally, it is important to mention the extensive and currently active line of work on cryptographic 
approaches to privacy, which have com e mainly from the theoretical computer science community. 
For instance, Feigenbaum et al.l (|2006|) develop a framework for secure computation of approx- 
imations; intuitively, a private approximation of a function / is an approx imation f that does 
not re veal information about x other than what can be deduced from f(x). llndyk and Woodruff 
(20061) consi der the problem of computing private appro ximate nearest nei ghbors in this setting. 



Dworkl (|2006|) revisits the notion of privacy formulated by IPaleniusI (|1977b|) . which intuitively de 



mands that nothing can be learned about an individual record in a database that cannot be learned 
without access to the database. An impossibility result is given which shows that, appropriately 
formalized, this strong notion of privacy cannot be achieved. An alternative notion of differential 
privacy is proposed, which allows the probability of a disclosure of private information to change 
by only a small multiplicative factor, depending on whether or not an ind i vidual participates in the 
database. This line of work has recently been built upon by iDwork et al. ( 2007 ). with connections 
to compressed sensing, showing that any method that gives accurate answers to a large fraction of 
randomly generated subset sum queries must violate privacy. 



III. Compressed Regression is Sparsistent 

In the standard setting, X is a n x p matrix, Y = Xp + e is a vector of noisy observations under a 
linear model, and p is considered to be a constant. In the high-dimensional setting we allow p to 
grow with n. The lasso refers to the following quadratic program: 

(Pi) minimize \\Y - X/3\\j such that < L. (3.1) 

In Lagrangian form, this becomes the optimization problem 

(P 2 ) minimize — ||F - Xfiftl + Wh, (3-2) 

where the scaling factor \/2n is chosen by convention and convenience. For an appropriate choice 
of the regularization parameter X = k{Y, L), the solutions of these two problems coincide. 
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In compressed regression we project each column Xj e W 1 of X to a subspace of m dimensions, 
using an m x n random projection matrix O. We shall assume that the entries of O are independent 
Gaussian random variables: 

Oy ~ iV(0 5 l/n). (3.3) 

Let X = OX be the compressed matrix of covariates, and let Y = OJ 7 be the compressed response. 
Our objective is to estimate ft in order to determine the relevant variables, or to predict well. The 
compressed lasso is the optimization problem, for Y = OX/? + Of = OX + e : 

(P 2 ) minimize ^-\\Y - Xfi\\l + *m\\fih, (3-4) 
2m 

with Q. m being the set of optimal solutions: 

Q ra = argmin -HI? - Xfi\\l + *m\\fi\\i- (3-5) 

fieRP Im 

Thus, the transformed noise ? is no longer i.i.d., a fact that complicates the analysis. It is convenient 
to formalize the model selection problem using the following definitions. 

Definition 3.1. (Sign Consistency) A set of estimators Q„ is sign consistent with the true /? if 

P (3/T„ g Q„ s.t. sgn{p n ) = sgn{p)) 1 as n -> oo, (3.6) 
where sgn(-) is given by 



sgn(x) = 



1 


ifx > 





if x = 


-1 


ifx < 0. 



(3.7) 



As a shorthand, we use 

£ (sgn{%) = sgn(P*)) := {3^ e Q„ such that sgn(#) = sgn(^*)} (3.8) 



to denote the event that a sign consistent solution exists. 



The lasso objective function is convex in /?, and strictly convex for p < n. Therefore the set of 
solutions to the lasso and compressed lasso (13.41 ) is convex: if /i and p are two solutions, then by 
convexity /? + p{fi — /?) is also a solution for any p e [0, 1]. 

Definition 3.2. (Sparsistency) A set of estimators Q„ is sparsistent with the true /? if 

P (3/?„ e Q„ s.t. suppifin) = supp(fi)) — > 1 as n — > oo, (3.9) 
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Clearly, if a set of estimators is sign consistent then it is sparsistent. Although sparsistency is the 
primary goal in selecting the correct variables, our analysis establishes conditions for the slightly 
stronger property of sign consistency. 

All recent work establishing results on sparsity recovery assumes some form of incoherence condi- 
tion on the data matrix X. Such a condition ensures that the irrelevant variables are not too strongly 
correlated with the relevant variables. Intuitively, without such a condition the lasso may be sub- 
ject to false positives and negatives, where an relevant variable is replaced by a highly correlated 
relevant variable. To formulate such a condition, it is convenient to introduce an additional piece 
of notation. Let S = {j : /J/ # 0} be the set of relevant variables and let S c = {1, ...,/?} \ S 
be the set of irrelevant variables. Then Xs and Xs c denote the corresponding sets of columns of 



Donoho et al. 


(2006) and 


TropD ( 


2004) 



Definition 3.3. (5-Incoherence) Let X be an n x p matrix and let S c {1, . . . , p} be nonempty. 
We say that X is S-incoherent in case 

W^XlXsWn + \\j;X T s X s - /isil^ <l-n, for some n e (0, 1], (3.10) 

where ||A||oo = max,- XjLi \ denotes the matrix oo-norm. 

Although it is not explicitly required, we only apply this definition to X such that columns of X 

II II 2 

satisfy \\Xj 2 = 0(n), V/' e [1, p}. We can now state the main result of this section. 

Theorem 3.4. Suppose that, before compression, we have Y = Xfi* + e, where each column of 
X is normalized to have l2-norm n, and a ~ N(0, o 2 I n ). Assume that X is S-incoherent, where 
S = supp(p*), and define s = \S\ and p m = min ;( =5 |/?*|. We observe, after compression, 



Y = Xp* + 7, 

where Y = OF", X = OX, and? = Of, where O^- ~ N(0, 1 /n). Suppose 



/16Ci5 2 4C 2 s 



) (In, 



with C 



4e 
J 6n 



H — I (In P + 21ogn + log 2(5 + 1)) < m < 

2.5044 nnd C 2 = VSe % 7.6885, and X m satisfies 

1 vT v \-l 



16 log n 



(3.11) 



(3.12) 



m» 2 l?, 1 
(a) - m ^ oo, and (b) — 
log(j9 - S) p m 



log S 



m 



( l n X sXsY 



0. 



Then the compressed lasso is sparsistent: 

P (supp(P, n ) = supp(fi)) — > 1 asm — » oo, 



(3.13) 



(3.14) 



where ft m is an optimal solution to (13.41 
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A. Outline of Proof for Theorem \3.4 



Our overall approach is to follow a deterministic analysis, in the sense that we analyze OX as a 
realization from the distribution of O from a Gaussian ensemble. Assuming that X satisfies the 
5-incoherence condition, we show that with high probability OX also sati sfies the 5-incoher ence 



condition, and hence the incoherence conditions (I7.1al) and (I7.1bl) used by lWainwrightl (I2006Q . In 
addition, we make use of a large deviation result that shows OO r is concentrated around its mean 
Imxmi which is crucial for the recovery of the true sparsity pattern. It is important to note that the 
compressed noise e is not independent and identically distributed, even when conditioned on O. 

In more detail, we first show that with high probability 1 — n~ c for some c > 2, the projected data 
OX satisfies the following properties: 

1. Each column of X = OX has ^2-norm at most m(l + rj/As); 

2. X is 5-incoherent, and also satisfies the incoherence conditions (|7.1al) and (I7.1bl) . 

In addition, the projections satisfy the following properties: 

1. Each entry of OO r — / is at most y/b logn/rc for some constant b, with high probability; 

2. F(\%{®x,Qy)-{x,y)\ > z) < 2exp (~cfrh) f°rany x, y e W with ||x|| 2 , ||y|| 2 < 1- 
These facts allow us to condition on a "good" O and incoherent OX, and to proceed a s in th e 



deterministic setting with Gaussian noise. Our analysis then follows that of iWainwrigh t (2006). 
Recall S is the set of relevant variables in /? and S c = {1, . . . , p} \ S is the set of irrelevant 
variables. To explain the basic approach, first observe that the KKT conditions imply that /? e IR P 
is an optimal solution to (|3.4I) . i.e., /? e Q m , if and only if there exists a subgradient 

z e d ||/? H i = {z e R p \ z t = sgn(#) for ft # 0, and \zj \ < 1 otherwise} (3.15) 

such that 

—X T XP- —X T Y + Az = 0. (3.16) 
m m 

Hence, the £ (sgn(/?) = sgn(^*)) can be shown to be equivalent to requiring the existence of a 
solution/? e ~R P such that sgn(/?) = sgn (/?*), and a subgradient z e 3||y5||i, such that the following 
equations hold: 

-%X s (fis-Pl)--X T S ce = -Xzsc, (3.17a) 
m m 

-X T s X s s -Pl)--X T s e = -Xzs = -AsgnGS$), (3.17b) 
m m 
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where zs = sgn(/?|) and Iz^l < 1 by definition of z. The existence of solutions to equations 
(13 . 17a[) and (|3.17b| ) can be characterized in terms of two events £(V) and £(U). The proof pro- 
ceeds by showing that V(E(V)) — > 1 and F(£(U)) — > 1 as m — > oo. 

In the remainder of this section we present the main steps of the proof, relegating the technical 
details to Section I7.B1 To avoid unnecessary clutter in notation, we will use Z to denote the 
compressed data X = OX and W to denote the compressed response Y = OY, and co = e to 
denote the compressed noise. 

B. Incoherence and Concentration Under Random Projection 

In order for the estimated fi m to be close to the solution of the uncompressed lasso, we require the 
stability of inner products of columns of X under multiplication with the random matrix O, in the 
sense that 



(QXi,QXj)Kt{Xi,Xj). 



(3.18) 



Toward this end we have the following result, adapted from [Rauhut et al.l (120071) . where for each 
entry in <J>, the variance is — instead of -. 



Lemma 3.5. (Adapted from Rauh ut et alj (120071) ) Letx, y e W with \\x\\ 2 , \\yW2 < 1- Assume 
that O is anm x n random matrix with independent N(0,n~ l ) entries (independent ofx, y). Then 
for all t > 



P 



(|- ( 

VI m 



Q>x, Oy) - (x,y) 



\ ( — mi 2 \ 



(3.19) 



with Ci = -4= « 2.5044 and C 2 = VSe « 7.6885. 



We next summarize the properties of OX that we require. The following result implies that, with 
high probability, incoherence is preserved under random projection. 



Proposition 3.6. Let X be a (deterministic) design matrix that is S-incoherent with £2- norm n, 
and let $ beam x n random matrix with independent N(0,n~ l ) entries. Suppose that 

/16Cis 2 4C 2 A 
m > ( ^— + — — ) (In p + chin + ln2(s + 1)) (3.20) 

V n n ) 

for some c > 2, where C\, C 2 are defined in Lemma [331 Then with probability at least 1 — l/n c 
the following properties hold for Z = OX: 
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1. Z is S -incoherent; in particular: 



I — Z re Z C I 

m o J I 



+ \\m Z S Z S ~ !s\ 



< 



4' 



< 1 - -. 



(3.21a) 
(3.21b) 



2. Z = (S?X is incoherent in the sense of (I7.1al) and (I7.1bl) : 



Z T S cZ s (Z T S Z S ) 



-l 



< l-»7/2, 



Amin (-Z 5 Zs) 



> 



3// 



(3.22a) 
(3.22b) 



3. The £2 norm of each column is approximately preserved, for all j : 



\Q>Xj\\ 2 - m 



< 



mrj 
~4s' 



(3.23) 



Finally, we have the following large deviation result for the projection matrix <D, which guarantees 
that R = <DO r — I mxm is small entrywise. 



Theorem 3.7. If <D is m x n random matrix with independent entries <D, ; ~ N(0, j-), then 
R = OO r - / satisfies 

\ m 2 

< (3.24) 



P 



( 



max I Rn I > -yi61ogn/n 



U 



m<ix\Rij\ 



> 



y/2 logn/n 



C. Proof of Theorem \3.4\ 



We first state necessary and sufficient cond itions on the event £(sgn(/? m ) = sgn(/?*)). Note that 
this is essentially equivalent to Lemma 1 in lWainwrightl (|2006|) : a proof of this lemma is included 
in Section EH for completeness. 



Lemma 3.8. Assume that the matrix Z^Zg is invertible. Then for any given X m > and noise 
vector co e W 1 , £ {sgn{p m ) = sgn(/?*)) holds if and only if the following two conditions hold: 

Z T sc Zs(Z T s Z s r l [^Z T s co-A m sgn(^ s )]-^Z T sc co\ < k m , (3.25a) 
sgn^ s + (^Z T s Z s )- l [^Z T s co-A m sgn(P* s )]) = sgn(j3* s ). (3.25b) 
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Let b := sgn(/?|) and e,- e IR S be the vector with 1 in i th position, and zeros elsewhere; hence 
|| e ; || 2 = 1. Our proof of Theorem 13 .41 follows that of lWainwrightl (|2006|) . We first define a set of 
random variables that are relevant to (I3.25al) and (|3.25bl) : 



Vj 6 S c , 
Vi e 5, 



Vj := Z] {z s (Z T s Z s )- 1 A m b + 



Ui 



ef + i^Zs)- 1 



-Zr ft) 

m j 



/mxm - -Z5(ZjZ5) ! Zj] — } ,(3. 26a) 
- A m b] . (3.26b) 



We first define a set of random variables that are relevant to Condition (I3.25al) . which holds if and 
if only the event 



£(V) 

holds. For Condition (|3.25bl) . the event 



max V; < k m 

jeSC i j\ 



max \Ui \ < p„ 

ieS 



S(U) := 

where p m := min, e 5 \ is sufficient to guarantee that Condition (I3.25bl) holds. 



(3.27) 



(3.28) 



Now, in the proof of Theorem 13.41 we assume that <D has been fixed, and Z = OX and <D<J> r 
behave nicely, in accordance with the results of Section l37Bl Let R = <J><D r — I mX m as defined 
in Theorem 13 .71 From here on, we use (|t7j|) to denote a fixed symmetric matrix with diagonal 
entries that are ^Jl61ogn/n and off-diagonal entries that are y/2 logn/n. 

We now prove that P (£(V)) and P (£(U)) both converge to one. We begin by stating two technical 
lemmas that will be required. 



Lemma 3.9. (Gaussian Comparison) For any Gaussian random vector (Xi, . . . , X n ), 



E ( max ) < 3^/\ogn max Je (xf). 

\l<f<n / \<i<n " 



(3.29) 



Lemma 3.10. Suppose that || -X $Xs — I s \\ is bounded away from 1 and 



m > 



^— + — (log p + 21ogn + log 2(5 + 1)). (3.30) 

V vr n ) 



Then 



implies that 



1 

Pm 

1 

Pm 



log S 1 T j 

1- k m \\(—x s Xs) 

m 1 n 



log s 



111 



-> 



0. 



(3.31) 



(3.32) 
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Analysis of €(V). Note that for each Vj, for j e S c , 

fij = E (Vj) = X m ZjZs(Z T s Z s r l b. 
By Proposition 13 .61 we have that 

juj < X m Z T SC Z S {Z T s Z s )~ l < (1 - n/2)X m , Vj e S c , 



Let us define 



Vj = Zj | 7 mxra — Z5(ZjZ5) *Zj J —J 



from which we obtain 

max I V/ < A m 

Hence we need to show that 



ZyZs {Z T s Zs) + max | Vj \ < X m (1 — fj/2) + max | Vj 



oo jeS c 



It is sufficient to show P (max ;e 5c | V} > 77/2) — > 0. 

By Markov's inequality and the Gaussian comparison lemma [3T9l we obtain that 



maxV, > rj/2) < \ / ' J > < V BKF -maxjE(V ; 2 ). 



E (max ;e5 r Vj) 6y/\og(p - s) 



Finally, let us use P = Zs{Z T s Zs) l Z T s = P 2 to represent the projection matrix. 

Var(iO) = E(v/) 

2 

= £jZj {[(/ mxm - P) O] [(/ mxm - P) Of} Z; 

" rZj [/ MXm - P] Z 7 + ^Zj(P -PR-RP + PRP)Z 



< — \\Zjf 2 + — WR-PR-RP + PRP\\ 2 \\Zj\\ z 2 



m 



m 



( 



< ( 1 + 4(m + 2) 



21og/z + 



2 log n ^ 



m 



where | Zj L < m + by Proposition 13 .61 and 
||P - PR- RP + PRP\\ 2 < 

IIPII2 + IIPII2 ll*ll 2 + \\R\\ 2 \\Ph + \\Ph \\Rh W p h 
< 4||P|| 2 <4||(|r /J |)|| 2 <4(m + 2) 1 



21ogn 
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given that ||7 — P\\ 2 < 1 and ||P||2 < 1 and the fact that (|r/j|) is a symmetric matrix, 

ll*ll 2 < ||(|n,;|)|| 2 < 



(3.41a) 



/21ogn /161ogn , /21ogn 
< (m-lU + J ^<( m+ 2) A / (3.41b) 



n v n 
Consequently Condition (13.13b ) is sufficient to ensure 

that E(»n«w|V;|) ^ o. Thus p (£(v)) 



— > 



1 as m — > oo so long as m < ^ 2 \ogn • 

Analysis of £(U). We now show that P (£(£/)) — > 1 ■ Using the triangle inequality, we obtain the 
upper bound 



max \Ui | < 

ieS 



(3.42) 



The second ^oo-norm is a fixed value given a deterministic OX. Hence we focus on the first norm. 
We now define, for all i e S, the Gaussian random variable 



Given that 6 ~ N(0, a 2 I nxn ), we have for all i e S that 

E(G/) = 0, 
Var(G f ) = e(g?) 



(3.43) 



(3.44a) 
(3.44b) 



{(IZJZ,)- 1 ±zJ<D<D'Z, (izJZ,)- 1 } e, 

= (^s)- 1 4 + {(friz,)- 1 ffiRZs (1ZJZ,)- 1 } e(3.44f) 

We first bound the first term of (I3.44fl) . By (I3.22bl) . we have that for all z e S, 



(3.44c) 

(3.44d) 
(3.44e) 



-l 



(7* 



»7 



(m Z 5 Z s) 



-1 



4(7 ' 



2 mA min (^ZjZ 5 ) < 3m?/' 



(3.45) 



We next bound the second term of (I3.44fl) . Let M = ^f-, where C = (^Z T S Z S ) 1 and 5 = 
Z^RZg- By definition, 

ei = [e U \, e irS ] = [0, . . . , 1, 0, . . .], where e Ui = 1, e t j = 0, Vj # i. (3.46) 
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Thus, for all i e S, 



Mi 



4 {mis)'' ^RZs (IZJZ,)- 1 } ei =2Z^i^ 

7=1 ^=1 

We next require the following fact. 

Claim 3.11. If m satisfies 0.1 2D . then for all i e S, we have max,- Mjj < (1 + ^) . 
The proof appears in Section l7?Hl Using Claim [3TTT1 we have by (|3.45l) . (13.471) that 



(3.47) 



max 

l<i<s 



By the Gaussian comparison lemma [379l we have 

eUbxIgA = Ed^ZjZs)" 1 ^! ) 

\l<'<i / Ml II 00/ 



< 3yiog7 max (G?) < 



1<( <5 



4cr /21og5 



77 v m 



(3.49a) 
(3.49b) 



We now apply Markov's inequality to show that P (E(U)) — > 1 due to Condition (13.13b ) in the 
Theorem statement and Lemma [3 .101 

1 - P (sgn (ft + (±Z T s Z s y l [±Z T s co - A w sgn($)]) = sgn($)) 

< P(max|[/j| > p m \ (3.50a) 

< p(max|Gi| + A m |(^zjz 5 ) _1 > p m ) 

\ ieS II oo J 

5 ^( E fc x|G ' l ) + "™l ( " Zsrzsr 'IL) 

> m \ n v m ooy 



(3.50b) 
(3.50c) 



-> 0. 



(3.50d) 
(3.50e) 



which completes the proof. □ 



IV. Compressed Regression is Persistent 
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Persistence ( IGreenshtein and Ritovl (120041) ) is a weaker condition than sparsistency. In particular, 
we drop the assumption that E(Y|X) = fi T X. Roughly speaking, persistence implies that a pro- 
cedure predicts well. Let us first review the Greenshtein-Ritov argument; we then adapt it to the 
compressed case. 

A. Uncompressed Persistence 

Consider a new pair (X, Y) and suppose we want to predict Y from X. The predictive risk using 
predictor fi T X is 

R(fi) =E(7 - P T X) 2 . (4.1) 

Note that this is a well-defined quantity even though we do not assume that E(Y \X) = fi T X. It is 
convenient to write the risk in the following way. Define Q = (Y, X\, . . . , X p ) and denote y as 

y = (-1, fi u . . . , P P ) T = {fio, fi u . . . ,P P ) T . (4.2) 

Then we can rewrite the risk as 

R(fi) = y T Zy, (4.3) 
where I = E(QQ T ). The training error is then R n (j3) = \ Z"=l( y i _ X J Pf = Y T Z n y, where 

t" = -Q T Q (4.4) 
n 

andQ = (Q\ Q\ ■ ■ ■ Ql) T where Q,] = (Y t , X u , . . . , X pi ) T ~ Q,Vi = l,...,n, are i.i.d. 
random vectors. Let 

B„ = W : < for L n = o ((n/logn) 1 / 4 ). (4.5) 
Let yS* minimize R(p) subject to ft e 5„: 

^ = argmin R{fi). (4.6) 

IIAIIl<i« 

Consider the uncompressed lasso estimator /?„ which minimizes R n (f$) subject to p e B n : 

fi n = argmin R n (fi)- (4.7) 

\\Ph<L„ 

Assumption 1. Let Qj, Qk denote elements of Q. Suppose that, for each j and k, 

E(|Z| 9 ) < q\M q - 2 s/2, (4.8) 

for every q > 2 and some constants M and s, where Z = Qj Qk — E(<2 ; Qk)- Then, by Bernstein's 
inequality, 



P 



(p$fc-Zj*| >e) <e~ cne2 (4.9) 
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for some c > 0. Hence, if p n < e n for some < <f < 1 then 



P I max 



> € I < P 2 n e- cn£l < e- cnel/1 . 



Hence, if e n 



2 log n 



, then 



P|max 



-> 0. 



Thus, 



Then, 



rn f \^ k -Z jk \ = P (J^y 



(4.10) 



(4.11) 



(4.12) 



sup \R(fi) - R n (fi)\ = sup |y r (S - S")y | < (L n + l) 2 max | X" k - X jk \. 
Hence, given a sequence of sets of estimators B n , 



(4.13) 



sup \R(fi) - R n (fi)\ = o P (X) 

fieB n 



for L n = o((n/logn) 1 /4). 



(4.14) 



We claim that under Assumption 1, the sequence of uncompressed lasso procedures as given 

p ^ 

in (14.71) is persistent, i.e., R(fin) — R(J3*) — > 0. By the definition of /?* e B n and fi n e 5 W , 

we immediately have R(fi*) < R(ft n ) and R n (fin) < Rnifi*)', combining with the following in- 
equalities, 



R{fin)-Rnifin) < SU P \R(fi) - R n (fi)\, 

[JeB„ 

R n (fi,) - R(A) < sup \R(fi) - R n (fi)\, 

PeB n 



we thus obtain 



(4.15) 
(4.16) 



\R(fi n ) - R(fi*)\ < 2 sup \R(fi) - R n (fi)\. 

/3eB„ 



(4.17) 



For every e > 0, the event [\R(fi n ) — R(fi*)\ > e] is contained in the event 



sup \R(fi) - R n (fi)\ > e/2 

peB n 



(4.18) 
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Thus, for L n = o((n/ log n) 1 / 4 ), and for all e > 

F (\R(J3„) - R(fi*)\ > e) <P( sup |/?05)-^„0S)| >6/2)^0, as n ^ oo. (4.19) 
The claim follows from the definition of persistence. 
B. Compressed Persistence 

Now we turn to the compressed case. Again we want to predict (X, Y), but now the estimator f$ n ^ m 
is based on the lasso from the compressed data of dimension m n ; we omit the subscript n from m n 
wherever we put {n,m} together. 

Let y be as in (14.21) and 

2"' m = — Q r O T OQ. (4.20) 

Let us replace R n with 

R n ,m(fi) = y T X n > m y. (4.21) 
Given compressed dimension m n , the original design matrix dimension n and p n , let 

/ m \ 1/4 

B n ,m = {P : II P 111 < for L„, m = o - " . (4.22) 

\log(np„)/ 

Let /?* minimize i?(/?) subject to /? e fi n m : 

y5*= argmin /?(#). (4.23) 

^:||/f||i<i»,m 

Consider the compressed lasso estimator fi n>m which minimizes R n , m (fi) subject to /? e B n m : 

Pn,m= argmin R n ,m(fi)- (4.24) 

P:\\Ph<L n , m 

Assumption 2. Let Qj denote the j element of Q. There exists a constant Mi > such that 

E(Q 2 .) <Mi, Vj e{l,...,p„ + l}, (4.25) 

Theorem 4.1. Under Assumption 1 and 2, given a sequence of sets of estimators B n>m C M p for 
log 2 (np n ) < m n < n, where B,^ m consists of all coefficient vectors ft such that \\p\\i < L n ^ m = 
o ((m„/ \og(np n )) 1 / 4 ) , the sequence of compressed lasso procedures as in 44.24D is persistent: 



R(Pn,m) ~ R(fi*) 4 0, (4.26) 



when p, t = (e" e ) for some c < 1/2. 
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Proof. First note that 



1 



E ( Z"' m ) = — E (Q r E (O r O) Q) = — E ( — Q r Q) 
( m„ m n \ n ' 



I. 



(4.27) 



We have that 

sup \R(/3) - R, hm (/3)\ = sup |y r (S -i"' ra )y| < (L„, m + 1) 2 max 



^jk ~ ^jk 



(4.28) 



We claim that, given p n = O (e" e ) with c < 1/2 chosen so that \og 2 (np n ) < m n < n holds, then 



max 



^jk ~ ^Jk 



(4.29) 



where E = ±E (Q r Q) is the same as <|3]), but (14201) defines the matrix S"' ra . 



Hence, given p n = O (e" e ) for some c < 1/2, combining (14.281) and (14.291 ), we have for L, v 
o ((m„/log(np„)) 1 / 4 ) and/? > m n > \og 2 (np n ), 



sup \R(fi) - R n ,m(fi)\ = op{\). 
By the definition of /?* e B n>m as in (14.231) and fl n>m e we immediately have 

\R(Pn,m) ~ R(fi*)\ < 2 SUp \R(fi) - R n , m {fi)\, 

given that 

R(P*) < R(fin,m) < Rn,m(fin,m)+ SU P \R(J3) ~ R n ,m(P)\ 

< Rn,m(P*)+ sup \R(fi) - R n , m (fi)\ 

P^B nm 

< R(fi.) + 2 sup \R(J3) - R n , m {fi)\- 

PsB, hm 

Thus for every e > 0, the event {|i?(/? n , ra ) — R(ft*)\ > e} is contained in the event 



sup \R(fi) - R n ,m(fi)\ > e/2 

fisBnjn 



(4.30) 
(4.31) 

(4.32a) 
(4.32b) 
(4.32c) 



(4.33) 



It follows that Ve > 0, given p n = O (e n ') for some c < 1/2, n > m n > \og 2 (np n ), and 



J n,m 



o{(m n /\og(np n )) l / A ), 



^(\R0n,m)-RiP*)\ > e) <P( sup - £„, m (/i)| > 6/2)^0, 



as n — > oo. (4.34) 
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■ — ■ / ' 

Therefore, R(j3 n ,m) — R(fi*) — > 0. The theorem follows from the definition of persistence. 

It remains to to show (I4.29I ). We first show the following claim; note that p n = O (e" e ) with 
c < 1/2 clearly satisfies the condition. 

/ II II 2 \ I cjM?n 

Claim 4.2. Let C = 2M\. Then PI max ; - \\Qj\\ 2 > Cn\ < ± so long as p n < 2—^ — for some 
chosen constant c \ and M\ satisfying Assumption!, 

Proof. To see this, let A = (A\, . . . , A n ) T denote a generic column vector of Q. Let fj, = 
E(A?). Under our assumptions, there exists c\ > such that 



< e 



where V t = Aj - /u. We have C = 2M\ > p + J ^TT so lon § as Pn < ^ 



9 



Then 



P 



(S.?>c„) , P (^-,)>»/») 
= iiy Vi >,^<l 



We have with probability 1 — l/n, that 

\Qj\ 2 < 2Mw, V/ = \,...,p n + \. 
The claim follows by the union bound for C = 2M\. □ 

II II 2 

Thus we assume that | Qj |L < Cn for all j, and use the triangle inequality to bound 



max I S": m — Z < max 
jk Jk J jk 

where, using p as a shorthand for p r: 



2 y T - (^Q r Q)y^ 



+ max 

jk 



(i 



1 



IOFI 



(of, ozi) ... <of, ox p ) 

(OXi, OF) IIOXiHl ... (OZi, OZp) 

|2 



(OX p ,OF) (OX p ,OXi> ... 



ox 



Pll2 



(4.35) 



(4.36a) 
(4.36b) 

(4.37) 



(4.38) 



, (4.39a) 



(p+l)x(p+l) 



(Y,X P ) 



(X U Y) \\Xi\\i ... (X U X P ) 

|2 



{X p ,Y} {X p ,Xi) 



X, 



(4.39b) 



P\\2 J(p+l)x(p+l) 



24 



We first compare each entry of Sy^ with that of ± (Q 1 Q) . k . 

Claim 4.3. Assume that II Q / II I < Cn = 2M x n, V/. By faJcing 6 = C A / 8Cl t^"' ", 

F(msoi\ — (<S>Qj,OQ k )--(Qj,Q k )\ > < \, (4.40) 
\j,k\m n n | 2/ n 1 

where C\ = « 2.5044 as in Lemma [331 and C is defined in Claim \4~2[ 

V67T 

Proof. Following arguments that appear before (I7.41al) . and by Lemma 13.51 it is straight 
forward to verify: 

— (®Qj,®Qk)--{Qj,Qk)\ >e) <2exp( ~™"t n ), (4-41) 
\| m„ n / \CiC z + C2CsJ 

where C2 = \/8e « 7.6885 as in Lemma 1331 There are at most ( ~ p "" l 2 1 ^ p " unique events given that 
both matrices are symmetric; the claim follows by the union bound. □ 

We have by the union bound and (flTTOl (H38T) . Claim|431 and Claim|431 

P (max I i;f - Syjfc| > ^) < (4.42a) 

p(max|±(Q r Q) . fc - Z^| > + P (max || \\ > Cnj + (4.42b) 

F\^\±{®Qj,®Qk) -liQj,Qk)\ >\ I max||e y |2< Cnj (4.42c) 
2 1 1 

< e -™e /8 + _ + (4.42d) 



n n 2 



Hence, given p n = O (e nC ) with c < 1/2, by taking 



we have 



P (max 2"i m - S/t > e ) < - -> 0, (4.44) 
V J k J ) n 

which completes the proof of the theorem. □ 
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Remark 4.4. The main difference between the sequence of compressed lasso estimators and the 
original uncompressed sequence is that n and m n together define the sequence of estimators for the 
compressed data. Here m n is allowed to grow from Q,(\og 2 (np n )) to n; hence for each fixed n, 

Pn,m , Vm„ such that log 2 (np n ) < m n < nj (4.45) 

defines a subsequence of estimators. In Section [6] we run simulations that compare the empirical 
risk to the oracle risk on such a subsequence for a fixed n, to illustrate the compressed lasso 
persistency property. 



V. Information Theoretic Analysis of Privacy 

In this section we derive bounds on the rate at which the compressed data X reveal information 
about the uncompressed data X. Our general approach is to consider the mapping X h» OX + A 
as a noisy communication channel, where the channel is characterized by multiplicative noise O 
and additive noise A. Since the number of symbols in X is np we normalize by this effective block 
length to define the information rate r„ ra per symbol as 

I(X; X) 

r n , m = sup . (5.1) 

p{X) np 

Thus, we seek bounds on the capacity of this channel, where several independent blocks are coded. 
A privacy guarantee is given in terms of bounds on the rate r WjW — » decaying to zero. Intuitively, 
if I(X; X) = H(X) — H(X \ X) «a 0, then the compressed data X reveal, on average, no more 
information about the original data X than could be obtained from an independent sample. 

Our analysis yields the rate bound r n m = 0(m/n). Under the lower bounds on m in our sparsis- 
tency and persistence analyses, this leads to the information rates 

(log(np)\ (\og 2 (np)\ 
I (sparsistency) r nm = 0\ I (persistence) (5.2) 
n ) V n J 

It is important to note, however that these bounds may not be the best possible since they are 
obtained assuming knowledge of the compression matrix <D, when in fact the privacy protocol 
requires that <D and A are not public. Thus, it may be possible to show a faster rate of convergence 
to zero. We make this simplification since the capacity of the underlying communication channel 
does not have a closed form, and appears difficult to analyze in general. Conditioning on O yields 
the familiar Gaussian channel in the case of nonzero additive noise A. 

In the following subsection we first consider the case where additive noise A is allowed; this is 
equivalent to a multiple antenna model in a Rayleigh flat fading environment. While our spar- 
sistency and persistence analysis has only considered A = 0, additive noise is expected to give 
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greater privacy guarantees. Thus, extending our regression analysis to this case is an important 
direction for future work. In Section I5TB1 we consider the case where A = with a direct analysis. 
This special case does not follow from analysis of the multiple antenna model. 



A. Privacy Under the Multiple Antenna Channel Model 



In the multiple antenna model for wireless communication (IMarzetta and Hochwaldlll999tlTelatai . 



19991) . there are n transmitter and m receiver antennas in a Raleigh flat-fading environment. The 
propagation coefficients between pairs of transmitter and receiver antennas are modeled by the 
matrix entries <D ;; ; they remain constant for a coherence interval of p time periods. Computing the 
channel capa city over multiple in t ervals requires optimization of the joint density of pn transmitted 



signals. IMarzetta and Hochwaldl (I1999|) prove that the capacity for n > p is equal to the capacity 
for n = p, and is achieved when X factors as a product of a p x p isotropically distributed unitary 
matrix and a p x n random matrix that is diagonal, with nonnegative entries. They also show 
that as p gets large, the capacity approaches the capacity obtained as if the matrix of propagation 
coefficients <J> were known. Intuitively, this is because the transmitter could send several "training" 
messages used to estimate <J>, and then send the remaining information based on this estimate. 

More formally, the channel is modeled as 

Z = OX + y A (5.3) 

where y > 0, Ay ~ N(0, 1), O tj ~ N(0, l/n) and \ X' J =1 E[XfA < P, where the latter is a 
power constraint. The compressed data are then conditionally Gaussian, with 

E(Z|X) = (5.4) 
E(ZijZ kl \X) = S ik (y%i + ^X tj X t \ (5.5) 

Thus the conditional density p(Z \X) is given by 

exp {-tr Uy 2 I p + X T X)~ l Z T z]} 



p(Z\X) 



{2n)P m / 2 det m/2 (y 2 /„ + X T X) 



(5.6) 



which completely determines the channel. Note that this distribution does not depend on <D, and 
the transmitted signal affects only the variance of the received signal. 

The channel capacity is difficult to compute or accurately bound in full generality. However, 
an upper bound is obtained by assuming that the multiplicative coefficients <J> are known to the 
receiver. In this case, we have that p(Z, <D | X) = p(O) p{Z \ <D, X), and the mutual information 
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I(Z, O; X) is given by 



I(Z, 0;Z) = E 
= E 
= E 



log 
log 

E 



P(Z,Q|X) 

p(Z,0) J 
P(Z\X, 3>) 
p(Z|0) 
p(Z|X, O) 



log 



p(Z I O) 



<D 



(5.7a) 
(5.7b) 
(5.7c) 



Now, conditioned on <D, the compressed data Z = (S>X + y A can be viewed as the output of a 
standard additive noise Gaussian channel. We thus obtain the upper bound 



sup/(Z;X) < sup/(Z, 0;X) 

p(X) p(X) 



r P (z\x, o) |^ 

= E sup E log O 

L />(Z|0) I 

pE |k)gdet ^/ m + ^OO^j 



< pm log 



(5.8a) 

(5.8b) 

(5.8c) 
(5.8d) 



where inequality (I5.8cl) comes from assuming the p columns of X are independent, and inequality 
(I5.8dl) uses Jensen's inequality and concavity of logdet S. Summarizing, we've shown the follow- 
ing result. 



Proposition 5.1. Suppose that E[Xj] < P and the compressed data are formed by 

Z = OX + y A 



(5.9) 



where <D is m x n with independent entries <D ;J ~ N(0, l/«) and A is m x p with independent 
entries A, ; - ~ N(0, 1). Then the information rate r n m satisfies 



^n,m — SUp 



I(X;Z) 



P (X) np 



m ( P\ 
* n l0g ( 1 + ^J 



(5.10) 



B. Privacy Under Multiplicative Noise 

When A = 0, or equivalently y = 0, the above analysis yields the trivial bound r n m < oo. Here 
we derive a separate bound for this case; the resulting asymptotic order of the information rate is 
the same, however. 
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Consider first the case where p = 1, so that there is a single column X in the data matrix. The 
entries are independently sampled as X ; - ~ F where F has mean zero and bounded variance 
Var(F) < P. Let Z = OX e IR ra . An upper bound on the mutual information I(X; Z) again 
comes from assuming the compression matrix O is known. In this case 



/(Z, 0;X) = H(Z\q>) - H(Z\X, O) 
= #(Z|0) 



(5.11) 
(5.12) 



where the second conditional entropy in (15.1 II) is zero since Z = OX. Now, the conditional 
variance of Z = {Z\, . . . , Z m ) T satisfies 



Var(Z, |0) = ^0?.VarX; < f^oj 



Therefore, 



/(Z,0;X) 



< 



< 



< 



7 = 1 

H(Z | O) 

in 

2>(Z ; |0) 

i=l 

m 

Z E 

i = \ 



-log [2n eP £ Of. 



m 1 I n \ 



-log(27r ej P) 



(5.13) 



(5.14a) 
(5.14b) 

(5.14c) 

(5.14d) 
(5.14e) 



where inequality (|5.14bl) follows from the chain rule and the fact that conditioning reduces entropy, 
inequality (I5.14cl) is achieved by taking F = N(0, P), a Gaussian, and inequality (|5.14dl) uses 
concavity of logdetS. In the case where there are p columns of X, taking each column to be 
independently sampled from a Gaussian with variance P gives the upper bound 



7(Z,0;X) < ^-\og{2neP) 



(5.15) 



Summarizing, we have the following result. 



Proposition 5.2. Suppose that E[Xj] < P and the compressed data are formed by 

z = OX 



(5.16) 
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where O is m x n with independent entries <D ;; ~ N(0, l/n). Then the information rate r n m 
satisfies 

I(X; Z) m 

r n , m = sup < — Xog{2neP). (5.17) 

P (X) np In 



VI. Experiments 

In this section we report the results of simulations designed to validate the theoretical analysis 
presented in the previous sections. We first present results that indicate the compressed lasso is 
comparable to the uncompressed lasso in recovering the sparsity pattern of the true linear model, 
in accordance with the analysis in Section [3] We then present experimental results on persistence 
that are in close agreement with the theoretical results of Section 0] 



A. Sparsistency 



Here we run simulations to compare the compressed lasso with the uncompressed lasso in terms 
of the probability of success in recovering the sparsity pattern of /?* . We use random m atrices for 
both X and <D, and reproduce the experimental conditions shown in lWainwrightl (|2006|) . A design 
parameter is the compression factor 

n 

(6.1) 



n 

f = - 
m 



which indicates how much the original data are compressed. The results show that when the 
compression factor / is large enough, the thresholding behaviors as specified in (12.61) and (12.71) 
for the uncompressed lasso carry over to the compressed lasso, when X is drawn from a Gaussian 
ensemble. In general, the compression factor / is well below the requirement that we have in 
Theorem 13 .41 in case X is deterministic. 

In more detail, we consider the Gaussian ensemble for the projection matrix <D, where Ojj ~ 
N(0, l/n) are independent. The noise vector is always composed of i.i.d. Gaussian random vari- 
ables e ~ N(0, a 2 ), where a 2 = 1. We consider Gaussian ensembles for the design matrix X with 
both diagonal and Toeplitz covariance. In the Toeplitz case, the covariance is given by 



(6.2) 



pxp 
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p 


p- 




pp-i p p-i ' 
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We u se p = 0.1. Both / and T(0.l) satisfy conditions dT4al) . (TMbl) and (1731) (IZhaoandYu . 



2007). For 2 = /, M = ^ = 1, while for S = 7/(0.1), 6> a « 1.84 and 6 t « 0.46 (IWainwright , 



2006), for the uncompressed lasso in (12.61 ) and in (12 .VI ). 



In the following simulati ons, we carry out the lasso using procedure lars(F, X) that implements 



the LARS algorithm of lEfron et al.l (|2004|) to calculate the full regularization path; the parameter 
X is then selected along this path to match the appropriate condition specified by the analysis. For 
the uncompressed case, we run lars(y, X) such that 

Y = Xfi* + e, (6.3) 

and for the compressed case we run lars(Oy, OX) such that 

Oy = ®Xfi* + Of. (6.4) 

In each individual plot shown below, the covariance 2 = ±E (X T X) and model p* are fixed across 
all curves in the plot. For each curve, a compression factor / e {5, 10, 20, 40, 80, 120} is chosen 
for the compressed lasso, and we show the probability of success for recovering the signs of /?* 
as the number of compressed observations m increases, where m = 28a 2 s log(p — s) + s + 1 
for 6 e [0.1, u], for u > 3. Thus, the number of compressed observations is m, and the number 
of uncompressed observations is n = fm. Each point on a curve, for a particular 6 or m, is an 
average over 200 trials; for each trial, we randomly draw X nxp , O mx „, and e e W 1 . However /?* 
remains the same for all 200 trials, and is in fact fixed across different sets of experiments for the 
same sparsity level. 

We consider two sparsity regimes: 

ap 

Sublinear sparsity: s{p) = for a e {0.1,0.2,0.4} (6.5a) 

log(ap) 

Fractional power sparsity: s(p) = ap 7 for a = 0.2 and y = 0.5. (6.5b) 



The coefficient vector /?* is selected to be a prefix of a fixed vector 



ft* = (-0.9, -1.7, 1.1, 1.3, 0.9, 2, -1.7, -1.3, -0.9, -1.5, 1.3, -0.9, 1.3, 1.1, 0.9) r (6.6) 



That is, if s is the number of nonzero coefficients, then 

fit-- 



Pi if / < s, 
otherwise. 



(6.7) 



As an exception, for the case s = 2, we set /?* = (0.9, — 1 .7, 0, . . . , 0) 1 
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ot 


n 


128 


D 
1 


256 


D 
1 


512 


D 
1 


1024 


s(p) 


m/p 


s(p) 


m/p 


s(p) 


m/p 


s(p) 


m/p 


Fractional Power 


0.2 


2 


0.24 


3 


0.20 


5 


0.19 


6 


0.12 


Sublinear 


0.1 


3 


0.36 


5 


0.33 


9 


0.34 








0.2 


5 


0.59 


9 


0.60 


15 


0.56 








0.4 


9 


1.05 


15 


1.00 











Table 1: Simulation parameters: s(p) and ratio of m/p for 9 = 1 and cr 2 = 1. 



After each trial, lars(F, X) outputs a "regularization path," which is a set of estimated models 
V m = {/?} such that each e V m is associated with a corresponding regularization parameter 
X(fi), which is computed as 




(6.8) 



The coefficient vector /? e "P m for which is closest to the value X m is then evaluated for sign 
consistency, where 

/ log(p -5)log7 

A m = cJ . (6.9) 

V m 

If sgn(y5) = sgn(y5*), the trial is considered a success, otherwise, it is a failure. We allow the con- 
stant c that scales X m to change with the experimental configuration (covariance 2, compression 
factor /, dimension p and sparsity s), but c is a fixed constant across all m along the same curve. 

Table 1 summarizes the parameter settings that the simulations evaluate. In this table the ratio 
m/p is for m evaluated at 9 = 1. The plots in Figures 1-4 show the empirical probability of the 
event £(sgn(/?) = sgn (/?*)) for each of these settings, which is a lower bound for that of the event 
{supp(/?) = supp (/?*)}. The figures clearly demonstrate that the compressed lasso recovers the 
true sparsity pattern as well as the uncompressed lasso. 
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Figure 1: Plots of the number of samples versus the probability of success. The four sets of 
curves on the left panel map to p = 128, 256, 512 and 1024, with dashed lines marking m = 
29s log(p — s) + s + 1 for 9 = 1 and s = 2, 3, 5 and 6 respectively. For clarity, the left plots only 
show the uncompressed lasso and the compressed lasso with / = 120. 
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0.0 0.5 1.0 1.5 2.0 2.5 3.0 
Control parameter 8 



Figure 2: Plots of the number of samples versus the probability of success. The three sets of curves 
on the left panel map to p = 128, 256 and 512 with dashed lines marking m = 20slog(p—s)+s + l 
for 9 = 1 and 5=3,5 and 9 respectively. 
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Identity; Sublinear, a=0.2; p=128 
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;ure 3: Plots of the number of samples versus the probability of success. The three sets of 
'es on the left panel map to p = 128, 256 and 512, with vertical dashed lines marking m = 
log(p — s) + s + 1 for 9 = 1, and s = 5, 9 and 15 respectively. 
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Figure 4: Plots of the number of samples versus the probability of success. The two sets of 
curves on the left panel correspond to p = 128 and 256, with vertical dashed lines mapping to 
m = 26s log(p — s) + s + 1 for 6 = 1, and s =9 and 15 respectively. 
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B. Persistence 



We now study the behavior of predictive and empirical risks under compression. In this section, 
we refer to lasso2(y ~ X, L) as the code that so lves the follow i ng l\ -constrained optimization 



problem directly, based on algorithms described by lOsborne et al.1 ([2000) 



(P 3 ) p = argmin \\Y - Xp\\ 2 (6.10a) 
such that \\p\\ x <L. (6.10b) 

Let us first define the following l\ -balls B n and B n>m for a fi xed uncompressed sample siz e n and 



dimension p n , and a varying compressed sample size m. By iGreenshtein and Ritovl (|2004r) . given 
a sequence of sets of estimators 

B n = {P : WWi < L n ), where L n = -=, (6.11) 

y/\ogn 

the uncompressed Lasso estimator /?„ as in (14.71) is persistent over B n . Given n, p n , Theorem 14. II 
shows that, given a sequence of sets of estimators 

m l ' A 

B n ,m = IP ■ II A Hi < L n , m ), where L n ^ m = (6.12) 

Vlog («/?„) 

for log 2 (np„) < m < n, the compressed Lasso estimator B n m as in (|4.24l) is persistent over B n m . 

We use simulations to illustrate how close the compressed empirical risk computed through (16.211 ) 
is to that of the best compressed predictor ft* as in (14.231) for a given set B n m , the size of which 
depends on the data dimension n, p n of an uncompressed design matrix X, and the compressed di- 
mension m; we also illustrate how close these two type of risks are to that of the best uncompressed 
predictor defined in (14.61 ) for a given set B n for all log np n < m < n. 

We let the row vectors of the design matrix be independent identical copies of a random vector 
X ~ N(0, S). For simplicity, we generate Y = X T p* + e, where X and p* e W, E (e) = and 
E (e 2 ) = a 2 ; note that E (Y \X) = X T /?*, although the persistence model need not assume this. 
Note that for all m < n, 

L n ^ m = : < L n (6.13) 

Vlog(n/?„) 

Hence the risk of the model constructed on the compressed data over B n>m is necessarily no smaller 
than the risk of the model constructed on the uncompressed data over B n , for all m < n. 

For n = 9000 and p = 128, we set s(p) = 3 and 9 respectively, following the sublinear 
sparisty (|6.5al) with a = 0.2 and 0.4; correspondingly, two set of coefficients are chosen for 

PI = (-0.9, 1.1,0.687,0, ...,0) r (6.14) 
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so that \\p* || j <L n and p* a e 5„, and 

^* = (-0.9, -1.7, 1.1, 1.3, -0.5,2, -1.7, -1.3, -0.9, 0,...,0) r (6.15) 
so that j > L„ and^ £ B n . 

In order to find /?* that minimizes the predictive risk R(P) = E ((y — X T p) 2 ), we first derive the 
following expression for the risk. With S = A r A, a simple calculation shows that 

E(Y - X T p) 2 - E(Y 2 ) = -p* T Xp* + \\Afi* - Ap\\. (6.16) 

Hence 

R(fi) = E(Y 2 ) -p* T Xp* + \\Afi* - Ap\\ (6.17a) 

= E(Y 2 )- p* T E(XX T )p* + \\Afi* -Afi\\l (6.17b) 

= a 2 + \AP* - Ap\\. (6.17c) 

For the next two sets of simulations, we fix n = 9000 and p n = 128. To generate the uncompressed 
predictive (oracle) risk curve, we let 

p n =argmini?(y5) = argmin \\A0* - Ap\\l. (6.18) 

\\Ph<L„ \\fih<L n 

Hence we obtain /?* by running lasso2(E2/?* ~ 2^, L n ). To generate the compressed predic- 
tive (oracle) curve, for each m, we let 

p n , m = argmin R(p) = argmin \\Afi* - Afi\\. (6.19) 

\\Ph<L n ,m \\P\\l<L„, m 

Hence we obtain /?* for each m by running lasso2(E3/?* ~ 23, L n>m ). We then compute oracle 
risk for both cases as 

R(P) = (P-p*fZ(p-p*) + a 2 . (6.20) 

For each chosen value of m, we compute the corresponding empirical risk, its sample mean and 
sample standard deviation by averaging over 100 trials. For each trial, we randomly draw X nxp 
with independent row vectors x\ ~ N(0, T(0.1)), and Y = Xp* + e . If P is the coefficient vector 
returned by lasso2(Oy ~ OX, L n , m ), then the empirical risk is computed as 

R(p) = y T ty, where t = — Q r O r OQ. (6.21) 

m 

where Q„ x ( P +i) = [Y, X] and y = (-1, pi, ... , p p ). 



38 



n=9000, p=128, s=3 
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Figure 5: L n = 2.6874 for n = 9000. Each data point corresponds to the mean empirical risk over 
100 trials, and each vertical bar shows one standard deviation. Top plot: risk versus compressed 
dimension for /?* = /?*; the uncompressed oracle predictive risk is R = 1. Bottom plot: risk 
versus compressed dimension for /?* = /?*; the uncompressed oracle predictive risk is R = 9.81. 
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VII. Proofs of Technical Results 



A. Connection to the Gaussian Ensemble Result 



We first state a result which directly follow s from the analysis of Theorem I3.4[ and we then com- 
pare it with the Gaussian ensemble result of lWainwrightl (I2006Q that we summarized in Section [2l 



Fi rst, let us sta t e the following sligh tly relaxed conditi ons that are imposed on the design matrix 
by IWainwrightl (|2006l) . and also by Izhao and Yul (|2007l) . when X is deterministic: 



X S cXs(X s Xs) 



-l 



< 1 — rj, for some rj e (0, 1], and 



A m i n (-Xg Xs) > C m i n > 0, 



(7.1a) 
(7.1b) 



where A m i n (A) is the smallest eigenvalue of A. In Section l7?Bl Proposition 17.41 shows that S- 
incoherence implies the conditions in equations (I7.1al) and (I7.1bl) . 



From the proof of Theorem 13 .41 it is easy to verify the following. Let X be a deterministic matrix 
satisfying conditions specified in Theorem 13 .41 and let all constants be the same as in Theorem l3.4l 
Suppose that, before compression, we have noiseless responses Y = Xfi* , and we observe, after 
compression, X = OX, and 



y = oy + e = xp* + e, 



(7.2) 



where O rax „ is a Gaussian ensemble with independent entries: ; 7 ~ N(0, l/n),Vz, j, and 
e ~ N(0, a 2 I m ). Suppose m > (^^- + (In p + 2\ogn + \og2(s + 1)) and X m 
satisfies (13.131) . Let fi m be an optimal solution to the compressed lasso, given X, Y, e and X m > 0: 

(7.3) 



^ m = argmin — ||F - Xfi\\l + X m \\fi\\i. 

BeRP 2m 



Then the compressed lasso is sparsistent: P (supp(y5 m ) = supp(/i)) — > 1 as m — > oo. Note 



that the upper bound on m < I l6 " ogn in (|3.12l) is no longer necessary, since we are handling the 
random vector e with i.i.d entries rather than the non-i.i.d <De as in Theorem 13 .41 



We first observe that t he design matrix X = OX as in (|7.2I) is exactly a Gaussian ensemble 
that IWainwrightl (|2006[) analyzes. Each row of X is chosen as an i.i.d. Gaussian random vector 
~ ./V(0, 2) with covariance matrix £ = ^X T X. In the following, let A m i n (Ess) be the minimum 
eigenvalue of 2 55 and A max (E) be the maximum eigenvalue of 2 . By imposing the 5-incoherence 
condition on X n * p , we obtain th e following two conditions on the covariance matrix 2 , which are 
required by IWainwrightl (I2006f) for deriving the threshold conditions (12.61 ) and (12.71 ), when the 
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design matrix is a Gaussian ensemble like X: 



-l 



< 1 — r\, for rj e (0, 1], and 



A m in(£ss) > Cmin > 0. 



(7.4a) 
(7.4b) 



When we apply this to X = OX where O is from the Gaussian ensemble and X is deterministic, 
this condition requires that 



X^XsiX^Xs)- 1 < \-n, for n e (0, 1], and 

oo 

Amin {n^S^s) - C m [ n > 0. 



(7.5a) 
(7.5b) 



since in this case E (^X r O r OX) = ^X T X. In addition, it is assumed in IWainwrightl (120061) that 
there exists a constant C max such that 



A m ax(^) C n 



(7.6) 



This condition need not hold for ^X T X; In more detail, given A max (^X r X) = ^A max (X r X) = 
- || X || |, we first obtain a loose upper and lower bound for || X|| \ through the Frobenius norm ||X|| F 

of X. Given that \\Xj f 2 = n, Vy e {1, . . . , p}, we have \\X\\ 2 F = E;=i ZLl l z yl 2 = Thus 
by ||X|| 2 < \\X\\ F < *J~p \\X\\ 2 , we obtain 



I-^IIf "S. II-XH2 < l|X||^- 



(7.7) 



which implies that 1 < A max (-X r X) < p. Since we allow p to grow with n, (17.61) need not hold. 



Finally we note that the conditions on X m in the Gaussian Ensemble result of IWainwrightl (12006|) 
are (13.131 a) and a slight variation of (I3.13I Z?): 



1 

Pm 



log S 



m 



^0; 



(7.8) 



hence if we further assume that ||(ixJXs) 



-1 



< D max for some constant Z) max < +00, as 



required by lWainwrightl (|2006|) on 



,-1 

'55 



(13.131 1?) and (17.81) are equivalent. 



Hence by imposing the 5-incoherence condition on a deterministic X nxp with all columns of 
X having ^2-norm n, when m satisfies the lower bound in (13.121) . rather than (|2.6I) with 6 U = 
2™ x with C max as in (17.61 ), we have shown that the probability of sparsity recovery through 
lasso approaches one, given X m satisfies (13.131 ), when the design matrix is a Gaussian Ensemble 
generated through OX with O mxn having independent O, j e Af (0, 1/n), V?', j. We do not have a 
comparable result for the failure of recovery given (12.71) . 
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B. S -Incoherence 



We first state some generally useful results about matrix norms. 

II is a matrix norm and 



Theorem 7.1. JHorn and Tohnsoiilll99oL p. 301) If ||| 

I + A is invertible and 

oo 

(/ + A)- 1 =^(-A/ 

k=0 



< 1, then 



(7.9) 



Proposition 7.2. If the matrix norm ||-|| has the property that \\I\\ = 1, and if A e M n is such 
that || A || < 1, we have 



1 



1 + I|A| 



< 



(7 + A)" 1 



< 



1 



1 - IIAI 



(7.10) 



Proof. The upper bound follows from Theorem 17711 and triangle-inequality; 



(I + A) 



-l 



£(-A)< 



k=0 



oo oo 1 

<^||_A||* = ^||A||* = — i- 



(7.11) 



The lower bound follows that general inequality ^ > Tr^rj-, given that || I \\ <\\B\\\\B 1 1| and 
the triangle inequality: ||A + 7|| < ||A|| + ||/|| = ||A|| + 1. 



(A + 7)- 1 



> 



1 1 

> 



IA + /II ~ 1 + IIAI 



(7.12) 



□ 



Let us define the following symmetric matrices, that we use throughout the rest of this section. 

1 



A = -X 1 s Xs-I\s\ 
n 

A = -(Q>X) T s {<I>X)s-Is = -Z T s Zs-I s . 
m m 

We next show the following consequence of the 5-Incoherence condition. 



(7.13a) 
(7.13b) 



Proposition 7.3. Let X he an n x p that satisfies the S -Incoherence condition. Then for the 
symmetric matrix A in !7.13al . we have WAW^ = ||A||i < 1 — r\, for some r\ e (0, 1], and 



\A\\ 2 < VllA||oollA||i< 1- 



(7.14) 



and hence A m i n (^Xj Xs) > tj, i.e., the S -Incoherence condition implies condition f/.l bl) . 
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Proof. Given that ||A|| 2 < 1, H/H2 = 1> and by Proposition 17 .21 



1 T 

A m i n (-X 5 X5) 



1 



1 



(I + Ay 



TT - 1 ~ l|A| ' 2 - 77 > (7 - 15) 



□ 



Proposition 7.4. The S -Incoherence condition on an n x /? matrix X implies conditions |[7.1aD 
and dTTTB . 



Proof. It remains to show (|7.1al) given Proposition 17.31 Now suppose that the incoherence 
condition holds for some rj e (0, 1], i.e.,|[^Xj c X 1 y|[ + HAHqq < 1 — rj, we must have 



\^ x sc x s\ 



<i-n, 



1 II Alloc 

given that || ^X^Xsj^ + HA^ (1 - 77) < 1 - V and 1 - HA^ > 17 > 0. 
Next observe that, given WAW^ < 1. by Proposition 17721 

1 

< 



(-x T s x s y l 




(7 + A)- 1 


n 


00 





1 - IIAI 



(7.16) 



(7.17) 



Finally, we have 



X S cXs(X s Xs) 



-1 



< 



< 



1 T 

—X S cXs 

n 

—XT, c Xv I 



1 - IIAI 



(-x T s x s y l 

n 



< 1 - r\. 



□ 



C. Proof of Lemma p3 



(7.18a) 
(7.18b) 



Let <D ;; = -j^gij, where gij,Vi = I, . .. ,m, j = 1, . . . , n are independent N(0, 1) random 
variables. We define 



k=i 7=1 



(7.19) 
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and we thus have the following: 



^ m n n 

e=\ k=i 7=1 

j m 

where 3^, W, are independent random variables, and 

Inn \ 

E(7,) = e XZ^w ( 7 - 21a ) 



.*=1 y=l 



n 



such that 



n 1 m 

— (Ox,d>y)-(x,y) = ~y\Yi~ (x,y) 

m m z — ' 



m 



. m 
m < < 



m l=X 



(7.20a) 
(7.20b) 



= 5>*y* E (*?,*) (7.21b) 
k=l 

= (x,y) (7.21c) 



Let us define a set of zero-mean independent random variables Z\, . . . , Z m , 

Z £ :=Y e -(x,y) = Y e -E(Y e ), (7.22) 



(7.23a) 



^ m 

-T{Y e -{x,y)) (7.23b) 



(7.23c) 



In the following, we analyze the integrability and tail behavior of Z(, W, which is known as "Gaus- 
sian chaos" of order 2. 

We first simplify notation by defining Y := X&=l Xy=i Skgj^kyj, where gt, gj are independent 
N(0, 1) variates, and Z, 

n n n 

Z:=Y-E(Y) = Y J X ^W + Z^- 1 )^' ( 7 - 24) 

k=i j=i,j^k k=i 
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where E (Z) = 0. Applying a general bound of iLedoux and Talagrandl (| 199 II) for Gaussian chaos 
gives that 



E(|Z|«) < (^-1)^(E(|Z| 2 ))^ 2 



(7.25) 



for all q > 2. 



The following claim is based on (17.251 ), whose proof appears in iRauhut et al.l (120071) . which we 
omit. 



Claim 7.5. iRauhut et alJ hoO'h ) Let M = e(E(|Z| 2 ) 1/2 ands = -^=E(|Z| 2 ). 

Vq > 2, E(Z q ) < q\M q ~ 2 s/2. 

Clearly the above claim holds for q = 2, since trivially E(|Z|*) < q\M q ~ 2 s/2 given that for 
q = 2 



q\M q - 2 s/2 = 2M 2 ~ 2 s/2 = s 
2e 



E(jZ| 2 ) w 1.2522E (l^l 2 ) • 



(7.26a) 
(7.26b) 



n n 



Finally, let us determine E (|Z| 2 ). 

E(|Z| 2 ) = E Yu Skg J x k y J + + Y J (g 2 k -l)x k y k 

\\k=ij=i,j& k=\ 

= Z E (* 2 ) E (* 2 ) Ayl + Z E - 1) *M 



k=l 



= Z*fr* +2 Z*fr* 



k=l 



< 2\\x\\l\\y\\l 

< 2, 



given that ||x|| 2 , II 3^ II 2 - 1- 

Thus for independent random variables Z,- , Vz = 1 , . . . , m, we have 

E(zf) < q\M q -\/2, 



(7.27a) 

(7.27b) 

(7.27c) 

(7.27d) 
(7.27e) 



(7.28) 



45 



where M = e(E(|Z| 2 ) 1/2 < eV2ando ; - = -^E(|Z| 2 ) < -£= < 2.5044, V/ 



Finally , we apply the following theorem, the proof of which follows arguments from iBennett 
Jl96i : 



Theorem 7.6. (Bennett Inequality (Bennett, 1962)) Let Z\, . . . ,Z m be independent random 
variables with zero mean such that 



E(|Z f |*) <q\M^\/2, 
for every q > 2 and some constant M and o, , V? = 1 , . . . , m . Then for x > 0, 



P 



(|||Z,||> r )<2exp(-^l 7 ) 



Wl'tfj V = Y!iLl v i- 



We can then apply the Bennett Inequality to obtain the following: 

Pi 1-TzJ > r 



F(\-(Ox,a>y)-{x,y) > z) 
Mm / 



< 2 exp ^ 



(mzY 



) 



= 2 exp 



mz 

2 / m Zr=i Of +2Mt 

/ mz 2 \ 
< 2expl -— — — I 

V Ci + C 2 zJ 



) 



with Ci = « 2.5044 and C 2 = V8e « 7.6885. □ 

-Jon 



D. Proof of Proposition \3J)\ 



(7.29) 



(7.30) 



(7.31a) 
(7.31b) 

(7.31c) 
(7.31d) 
(7.31e) 



We use Lemma 13.51 except that we now have to consider the change in absolute row sums of 
|| ^Xg C Xs I and || AH^ after multiplication by <J>. We first prove the following claim. 

Claim 7.7. Let X be a deterministic matrix that satisfies the incoherence condition. If 

-[®X i ,®X j )--[Xi,Xj) <z, (7.32) 



46 



for any two columns X, , Xj of X that are involved in Q.21bD , then 



-(OX)£ c (OX)s 
m 



+ Al < 1 - rj + 2sz, 



and 



Amin (~Z S Z S ) > n~ST. 



(7.33) 



(7.34) 



Proof. It is straightforward to show (17.331) . Since each row in i(OZ)f c (OZ) 5 and A has j 
entries, where each entry changes by at most i compared to those in ^X T X, the absolute sum of 
any row can change by at most sz, 



m 



(OX)£ c (OX)s 



-X T S cXs 



WL-iiaii 



< ST, 

< ST, 



and hence 



-(®xy sc (®x) s 

m 



1 II II oo — 



1 J 

-X S cX s 
n 



+ HAIloo +25T 



< 1 — rj + 2st. 



(7.35a) 
(7.35b) 

(7.36a) 
(7.36b) 



We now prove (|7.34l) . Defining E = A — A, we have 

||£|| 2 < smax\Aij - A Uj \ < sz, (7.37) 

hi 

given that each entry of A deviates from that of A by at most z. Thus we have that 

||A|| 2 = \\A + E\\ 2 (7.38a) 

< ||A|| 2 + ||£||2 (7.38b) 

< \\A\\ z + smajL\E ifj \ (7.38c) 

hi 

< l-ri + sz, (7.38d) 
where ||A|| 2 < 1 — rj is due to Proposition 17 .31 

Given that ||/|| 2 = 1 and ||A|| 2 < 1, by Proposition E2] 

A min (izfZ 5 ) = 1 (7.39a) 



1 



(7.39b) 



||(/ + A)-i|| 2 

> 1-||A|| 2 (7.39c) 

> tj-sz. (7.39d) 
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□ 



We let £ represents union of the following events, where z = 4-: 



1. 3i e S, j e S c , such that \±(®Xi, ®Xj) - ± (X„ > r, 

2. 3/, /' e S, such that |1 (OX ; , - \ {X u X r )\>x, 

3. 3j e S c ', such that 



1 II ii 2 1 ii i,2 

- OX; \\Xj f 



> Z". 



(7.40a) 
(7.40b) 



Consider first the implication of £ c , i.e., when none of the events in £ happens. We immediately 
have that (I3.21bl) . (17.341) and (I3.22bl) all simultaneously hold by Claim [TTTl and (I3.21bl) implies 
that the incoherence condition is satisfied for Z = OX by Proposition 17 .41 

We first bound the probability of a single event counted in £. Consider two column vectors x = 
-^=, y = -7= e IR n in matrix -4=, we have II jc II o = 1, IMU = 1, and 

P^(OX,-,OX ; )-^(X ; ,X 7 )| > ^ (7.41a) 

= P (1- (Ox, Oy) -(*,?) I > r) <2cxp( ~ MT ) (7.41b) 
Mm I / \Ci + C2T / 

< 2exp - '' (7.41c) 
V Ci+C 2 rj/4sJ 

given that x = 4-. 

We can now bound the probability that any such large-deviation event happens. Recall that p is the 
total number of columns of X and s = \S\; the total number of events in £ is less than p(s + 1). 
Thus 



P 



(£) < p(5 + l)P( \-(OX i ,Q>Xj)--(X i ,X j ) >^f) (7.42a) 
Vim 1 ' n * ' As ) 

( mn 2 /\6s 2 \ 

< 2p(s + 1) exp ( — ) (7.42b) 

= 2p(s + l)exp(-(ln/? + clnn + In 2(5 + 1))) < — , (7.42c) 

n c 
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given that m > (^r^ + ^) (ln/> + clnn + In 2(5 + 1)). □ 
E. Proof of Theorem \3. 71 

We first show that each of the diagonal entries of <D<J> r is close to its expected value. 

We begin by stating state a deviation bound for the Xn distribution in Lemma [7781 and its corollary, 
from which we will eventually derive a bound on \Rij\. Recall that the random variable Q ~ /n 
is distributed according to the chi-square distribution if Q = X"=i Yf with Y[ ~ Af(0, 1) that are 
independent and normally distributed. 



— i 

Lemma 7.8. ( Johnstone (2001)) 



P 



(~n ~ 1 < _<? ) ~ 6XP ("T") ' f ° r ° ~ 6 ~ l ' (7.43a) 
P - 1 > e\ < exp , for0<e (7.43b) 



Corollary 7.9. (Deviation Bound for Diagonal Entries of <DO r ) Given a set of independent 
normally distributed random variables X\, . . . , X n ~ N(0, a^), forO < e < j, 

p (|;| x '-^| > f ) sexp (lf ) +exp (lS)- (1M) 



Proof. Given that X u . . . , X n ~ N(0, <r|), we have ^ ~ JV(0, 1), and 



2 



Z(— ) -Xl (7.45) 



Thus by Lemma 17781 we obtain the following: 



/l" X? \ /-^ 2 \ 

PI -> -4-1 < -e I < exp{ |,0<6<1 (7.46a) 

Vtl° 2 x ) v 4 ; 

/l " X ; 2 \ /-3ne 2 \ 1 

p (-z^-i>^ * -K^)' ^^- (7 - 46b) 
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Therefore we have the following by a union bound, for e < i, 



(|;J>H>^ 

< p( X* - 1 < __L ) + P( ^ - 1 > ±- ) 



< exp 



(7.47a) 

(7.47b) 

(7.47c) 
(7.47d) 
(7.47e) 



□ 



We next show that the non-diagonal entries of OO r are close to zero, their expected value. 



Lemma 7.10. (Johnstone (2001)) Given independent random variables X\. 
z\Zi, with z i and zi being independent N(0, 1) variables, 



, X n , where X\ 



(1 " 



b log n \ 



< Cn 



-3b/2 



(7.48) 



Corollary 7.11. (Deviation Bound for Non-Diagonal Entries of <DO r ) Given a collection of 
i.i.d. random variables Y\, . . . , Y n , where Y( = x\%2 is a product of two independent normal 
random variables x\, x% ~ Af(0, o x ), we have 



(1 ^ / A log n \ 



> J — I < 2Cn- 3A/2a * 



(7.49) 



Proof. First, we let 



X - — - Xl Xl 



By Lemma l7.1Q[ symmetry of the events 
and a union bound, we have 

Pi I- 



(7.50) 



b log n 
n 



and 



£> log n 
n 



(7.51) 
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Thus we have the following 



/ll^y,- lb log n\ 



PI - 

\ n ■ 

< 2Cn~ 3b/2 , 



> a 



x 



b log n \ 

~ir) 



(7.52a) 
(7.52b) 



and thus the statement in the Corollary. □ 



We are now ready to put things together. By letting each entry of O mx „ to be i.i.d. N(0, -), we 
have for each diagonal entry D = X/Li ^f, where X, ~ N(0, -), 



E(D) = 1, 



and 



P 



(|i>. 2 -> 



> 



Mogn\ m / 



pi \-yxf-4 



> 



b log n \ 



< „-*/4 + n ~3*/16 



(7.53) 

(7.54a) 
(7.54b) 



where the last inequality is obtained by plugging in e = J °f n and a\ = - in (17.441) . 



For a non-diagonal entry W = J^Li ^i, where Y[ = x\x% with independent x\, X2 ~ N(0, -), we 
have 



and 



P 



E(W) = 0, 



(7.55) 



(7.56) 



by plugging in o\ = ^ in ( I7.52al) directly. 

Finally, we apply a union bound, where b = 2 for non-diagonal entries and b = 16 for diagonal 
entries in the following: 



<■ 



P( 3i,j,s.t.\Rij\ > yEi^L^ < 2C(m 2 -m)n~ 3 +mn~ 4 + mn~ 3 



O 



(m 2 n- 3 ) = O (^—) , 
\ / \n z logn/ 



(7.57a) 
(7.57b) 



given that m 2 < bl " an for b = 2. □ 
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F. Proof of Lemma 

Recall that Z = X = OX, W = Y = OY, and co = e = Of, and we observe W = Zfi* + co. 

First observe that the KKT conditions imply that /? <e M. p is optimal, i.e., /? e Q., n for Q ra as defined 
in (|3.5I) . if and only if there exists a subgradient 

1 e 5 ||/?|| 1 = {z e M p | z ; - = sgn(#) for # # 0, and 1 < 1 otherwise} (7.58) 

such that 

—Z T ZJ3-—Z T W + A m z = 0, (7.59) 
m m 

which is equivalent to the following linear system by substituting W = Zfi* + co and re-arranging, 

—Z T Z0-j3*)-—Z T co + A m z = O. (7.60) 
m m 

Hence, given Z, /?* ,co and A m > the event £ (sgn(/? ra ) = sgn(/?*)) holds if and only if 

1. there exist a point ^ e R p and a subgradient z e 3 ||/?|| j such that (|7.60l) holds, and 

2. sgn(/?s) = sgn(/?|) and fis c = P* S c = 0, which implies that zs = sgn(/?|) and \zs c \ < 1 by 
definition of z. 

Plugging fis c = P* S c = and zs = sgn(/?p in (17.601) allows us to claim that the event 

£ (sgn(^) = sgn(/?*)) (7.61) 

holds if and only 

1. there exists a point /? e M p and a subgradient z e 3 || /? || 1 such that the following two sets of 
equations hold: 

-Z T sc Z s (fis-Ps)--Z T S cCo = -X m z S c, (7.62a) 
m m 

-Z T s Z s (fis-/3*s)--Z T s co = -X m i s = -X m %gn{fi* s ), (7.62b) 
m m 

2. sgn0 s ) = sgn(^|) and fa = P* S c = 0. 
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Using invertability of ZlZs, we can solve for fls and zs c using (I7.62al) and (|7.62bl) to obtain 



— X m Zs c = Zg C Zs(Zg Zs) 1 

h = p* s + {-z T s z s y l 



1 



m 



Z S cca, 

m 



m 



Zgco - A m sgn(J3 s ) 



(7.63a) 
(7.63b) 



Thus, given invertability of ZjZs, the event £ (sgn(/? ra ) = sgn(/?*)) holds if and only if 



1. there exists simultaneously a point /? e W p and a subgradient z e d such that the 
following two sets of equations hold: 



— KiZs c = ZlcZs{ZlZs) 1 



1 

m 



1 



m 



-Z\w - /l m sgn(^) 



Zlco, (7.64a) 

m 



^5 = Pl + {-z T s z s y l 



i 



Z s cy - A m sgn(^) 



(7.64b) 



2. sgn(^) = sgn(^) and fa = P% = 0. 

The last set of necessary and sufficient conditions for the event £ (sgn(/? m ) = sgn(/T)) to hold 
implies that there exists simultaneously a point j?eR p and a subgradient z e d || ft || 1 such that 



Zj f Z5(ZjZ5) 1 



( 



1 T 

— Z s co- A m sgn(p* s ) 
m 



Zwffl 



/// 



sgnOff 5 ) = sgn Ift + i-Z'sZs) 



-l 



/77 



' 1 T 

— Z s co- A m sg 
m 



|-^mZ5 c l < /U(7.65a) 
sgn(^), (7.65b) 



given that \zs c \ < 1 by definition of z- Thus (|3.25al) and (I3.25bl) hold for the given Z, /?* ,co and 
X m > 0. Thus we have shown the lemma in one direction. 



For the reverse direction, given Z, /?*, co, and supposing that (I3.25al) and (|3.25bl) hold for some 
X m > 0, we first construct a point /? e R p by letting /?$c = /?* c = and 



/5 5 =^ + (-ZiZ 5 ) 
m 



-l 



1 

— Z 5 co - A ra sgn(y5|) 
m 



(7.66) 



which guarantees that 



n(^ 5 ) = sgn ^ 



1 



/?* + (-Z^Z 5 ) 



-l 



/77 



1 



/77 



Z T s co - X m Sg 



(/?*) (7.67) 



by (|3.25bl) . We simultaneously construct z by letting zs = sgn(/?s) = sgn(/?|) and 

1 



Z S cZs(Z s Zs) 



-l 



— Z 5 a> - /l m sgnO# 5 ) 
m 



1 r \ 

Zrcffl I 

m * J 



(7.68) 
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which guarantees that \zs c \ < 1 due to (I3.25bl) ; hence z e 8 ||/?|| r Thus we have found a 
point p e R p and a subgradient z e 9 l^li suc ^ mat s S n (/0 = s g n (/?*) an d the set of equa- 
tions (|7.64al) and (I7.64bl) is satisfied. Hence, assuming the invertability of ZjZs, the event 
£ (sgn(/? m ) = sgn(/?*)) holds for the given Z, /?*, <u, X m . □ 



G. Proof of Lemma \3.10 

Given that ^ZjZ 5 = A + I s , we bound || (^ZjZ^)- 1 through || (A + Z,)" 1 1| . 
First we have for m > + ^) (In p + c In n + In 2(5 + 1)), 

lAl^ < IIAIloo + | < 1 - j/ + j//4 = 1 - 3I//4, (7.69) 

3 = 1 and || A || < 1, by 



where rj e (0,1], due to (|3.10l) and (|3.21al) . Hence, given that 
Proposition [VT21 



-l 



(A + 7,) _1 



I 4 

< jpSTTj < . 

oo 1 - A 3?/ 

II II oo 



(7.70) 



Similarly, given 



< 1 , we have 



1 



1 + I|A|| C 



< 



I 



-1 



(A + 7,)- 1 



< 

oo 1 — 



(7.71) 



Given that ^ 

pm 



— > 0, we have 



1 



1 



Pm 1+||A|| C 



1 



— » 0, and thus 



l + l|A|| t 



Pm 1 - A 



Pm 1 + IIAIloo 1 - A 



< 



Pm 

0, 



1 / 4(2-iy) \ 
1 + IIAIIooV 3rj ) 



by (177701) and the fact that by (137101) . 1 + 



<2-rj. □ 



(7.72a) 

(7.72b) 
(7.72c) 



//. Proof of Claim\3Jl 



We first prove the following. 



Claim 7.12. Ifm satisfies ((37121) . then i max/j (£,-,_,■) < 1 + 
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Proof. Let us denote the i th column in Z$ with Z$ ,/. Let x = Z$j and y = Z$j be m x 1 
vectors. By Proposition 13 .61 || jc || ^ > \\yW2 - m ((1 + r")- We have by function of x, y, 

mm mm 

(=l j=i i=i j=i 



< max \R U I Y Y |xj Wyj \ = max |# ( - ; ; 1 (V \x, |)(Y |y ; |) (7.73b) 

i = l j=l i = l j = l 

< max|i? ; , 7 |m \\x\\ 2 \\y\\ 2 < max|i? ; j|m 2 (l + -p) . (7.73c) 



Thus the claim follows given that max,j | < 4y^jp and 4m < J%^- □ 
Finally, to finish the proof of Claim [3TTT1 we have 



c7 />'c, 1 1 



max Mi i = max — '- = — max Cj Bd = — max V VC; ,-Cj ^5/ * | (7.74a) 



i m mi mi 

— max 15; j I max ( Y |C />7 -| Y ) 



where HCH^ = 
ln2(s + l)). □ 



< - max \Bij\ max > |C U | > |C a | (7.74b) 

(m \ I m \ 

2>,j|j <(l + £)|max2>«lj < 7 ' 74c > 

(izJZs)" 1 !^ < ^ as in (E701) for m > + ^) (lnp + c Inn + 



Remark 7.13. In fact, max,j M;j = max i; , M,,,-. 



VIII. Discussion 

The results presented here suggest several directions for future work. Most immediately, our cur- 
rent sparsity analysis holds for compression using random linear transformations. However, com- 
pression with a random affine mapping X h-> OX + A may have stronger privacy properties; we 
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expect that our sparsity results can be extended to this case. While we have studied data compres- 
sion by random projection of columns of X to low dimensions, one also would like to consider 
projection of the rows, reducing p to a smaller number of effective variables. However, simu- 
lations suggest that the strong sparsity recovery properties of I \ regularization are not preserved 
under projection of the rows. 

It would be natural to investigate the effectiveness of other statistical learning techniques under 
compression of the data. For instance, logistic regression with l\ -regularization has recently been 
shown to be effec t ive in isolating relevant variables in high dimensional classification problems 
(|Wainwright et all |2007|) : we expect that compressed logistic regression can be shown to have 
similar theoretical guarantees to those shown in the current paper. It would also be interesting 
to extend this methodology to nonparametric methods. As one possibility, the rodeo is an ap- 
proach to sparse nonparametric r egression that is based on thresholding derivatives of an estimator 
(ILafferty and Wassermanl |2007[) . Since the rodeo is based on kernel evaluations, and Euclidean 
distances are approximately preserved under random projection, this nonparametric procedure may 
still be effective under compression. 

The formulation of privacy in Section |5] is, arguably, weaker than th e cryptographic-style guaran- 
tees sought through, for example, differential privacy (ID workl 120061) . In particular, our analysis in 
terms of average mutual information may not preclude the recovery of detailed data about a small 
number of individuals. For instance, suppose that a column Xj of X is very sp arse, with all but 
a few entries zero. Then the results of compressed sensing (jCandes et all l2006j) imply that, given 
knowledge of the compression matrix O, this column can be approximately recovered by solving 
the compressed sensing linear program 



mm 
such that 



II Xj ||i 

Zj = ®Xj. 



(8.1a) 
(8.1b) 



However, crucially, this requires knowledge of the compression matrix O; our privacy protocol 
requires that this matrix is not known to the receiver. Moreover, this requires that the column is 
sparse; such a column cannot have a large impact on the predictive accuracy of the regression 
estimate. If a sparse column is removed, the resulting predictions should be nearly as accurate as 
those from an estimator constructed with the full data. We leave the analysis of this case this as an 
interesting direction for future work. 
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